Declarative checkpoint config conversion (Llama pilot)#508
Open
jlamypoirier wants to merge 11 commits intomainfrom
Open
Declarative checkpoint config conversion (Llama pilot)#508jlamypoirier wants to merge 11 commits intomainfrom
jlamypoirier wants to merge 11 commits intomainfrom
Conversation
Eight config fields whose values directly affect model architecture were tagged as feature/core/(none). They drive the upcoming declarative-converter coverage check, which uses FieldHint.architecture as the source of truth for "must be handled by every checkpoint format". - AttentionConfig.dense_layer (output projection presence) - AttentionConfig.softmax_scale_power (attention scaling) - MLPConfig.activation (forward-pass activation type) - MoEMLPConfig.router (routing weights drive token assignment) - Llama3RotaryConfig: scale_factor, low_frequency_factor, high_frequency_factor, original_context_length - YarnRotaryConfig: scale_factor, attention_factor, beta_fast, beta_slow, original_context_length - StochasticMixerConfig.main_mixer_name (selects inference mixer) - PatchEmbeddingsConfig.patch_height/patch_width (input tokenization) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reintroduces the declarative config-conversion shape that pre-dated PR #362, applied within the post-#362 modular per-section structure. Replaces the imperative import_config/export_config bodies with a small set of named primitives and a recursive walker driven by per-section declarations. Primitives in fast_llm.engine.checkpoint.external: - RenameConfigConverter — 1:1 path rename - ConstantExportConfigConverter — write constant on export, assert on import - ConstantImportConfigConverter — assert on export, inject on import - DefaultConfigConverter — rename with HF-side fallback - OptionalConfigConverter — emit/import only when non-sentinel - IgnoredConfigConverter — declare a field as intentionally not converted - CustomConfigConverter — escape hatch for cross-field transforms - NestedConfigConverter — recurse into a fixed-typed sub-config; flat-merges HF output into the parent (transformer side is assumed flat) - DispatchConfigConverter — runtime type dispatch for polymorphic sub-configs ConfigSectionConverter is the per-Fast-LLM-class converter base. Subclasses declare their conversion via _create_config_converters() and inherit import_config/export_config concretely. The architecture-coverage check fires only when type(config) exactly matches the converter's declared fast_llm_config_class — strict subclass types defer to a more specific converter, allowing yet-to-be-migrated subclasses (e.g., Mixtral on Llama) to call super().export_config() without tripping the parent's check on fields the parent doesn't know about. The walker is implicit: NestedConfigConverter / DispatchConfigConverter call the public import_config/export_config on the sub-converter class so subclass overrides participate, rather than a private path that bypasses them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pilot of the new ConfigSectionConverter framework. Each Llama section converter (Normalization/MLP/Attention/Block/Embeddings/Head/BaseModel) now declares its conversion via _create_config_converters() instead of imperative import_config/export_config bodies. Weight side is unchanged. Notable shape decisions: - LlamaDecoderConverter stays as a regular (imperative) class because Fixed/Pattern block-sequence dispatch doesn't lend itself to the declarative shape. LlamaBaseModelConverter wires it in via a small CustomConfigConverter; subclasses (Mistral, Qwen2, MTP-Llama, ...) continue to plug in different block converters via block_converter_class. - _check_config is retained as an overridable classmethod and called from the linear_layers CustomConfigConverter, so Qwen2 can keep its asymmetric Q/K/V bias rule without re-implementing the export. - IgnoredConfigConverter is used for ParameterConfig sub-fields with no architecture-significant content (weight, output_weight, word_embeddings), and for prediction_heads (which Llama HF doesn't expose; subclass MTP-Llama adds it imperatively). - peft uses CustomConfigConverter to assert NoPeftConfig on export. Llama HF format cannot represent PEFT, so a configured LoRA now fails loudly rather than being silently dropped. - Rotary remains in CustomConfigConverter — the v4/v5 transformers split (rope_theta/rope_scaling vs. rope_parameters) and three rope_type variants don't fit pure rename primitives. Verified with live round-trips of Llama-3, Qwen2, Mistral, Mixtral, and MTP-Llama HF configs, plus tests/models/test_checkpoint.py for all GPT formats (139 passed, 0 failed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5567a71 to
0c406db
Compare
Adds `_validate_export(cls, config)` classmethod hook on `ConfigSectionConverter`, called automatically from `export_config` after the architecture-coverage check. Replaces five `CustomConfigConverter`-as-validator blocks (`linear_layers`/`layers` in attention and MLP, `position_embeddings` in embeddings, `peft` in base model, plus the `_check_config` chain on attention) with `IgnoredConfigConverter` for field-claiming + small `_validate_export` overrides. Mistral and Qwen2 rename their `_check_config` overrides accordingly; Pixtral's imperative export updates its `cls._check_config(config)` call site. Also addresses several reviewer-flagged correctness/cleanup items: - Drop the half-removed `parent_context` parameter from every primitive's `import_to` signature (and from `CustomConfigConverter`'s `import_fn`). It was unreachable through the walker. - `_check_architecture_coverage` now reads `cls.fast_llm_config_class` directly instead of `getattr(..., None)`, surfacing missing class-attribute declarations as `AttributeError` rather than silently disabling the safety net. - Drop the unused `hf_paths` parameter from `CustomConfigConverter.__init__`. There is no symmetric HF-side coverage check yet, so the field was cosmetic. - Add a TODO note in `_check_architecture_coverage` documenting that the `MoEMLPConfig`/`MambaConfig`/etc. safety net is gated on later migrations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dict of named per-block configs is unambiguously architecture metadata; without an explicit hint it defaulted to `unknown`, hiding it from the architecture-coverage check used by declarative checkpoint converters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two additions, both required by Apriel2's nested HF schema: - `NestedConfigConverter` gains an optional `hf_path` kwarg. When set, the sub-converter's output is placed under that nested key instead of being flat-merged. Existing flat-merge behavior is unchanged when `hf_path` is omitted. - New `TypedDictContainerConfigConverter` for `dict[str, Config]` fields where each entry is round-tripped through a per-class section converter. Polymorphic dispatch via the entry's runtime type on export and the HF discriminator on import. A homogeneous mode (single registered class with `hf_type_name = None`) skips the discriminator entirely. Both `DispatchConfigConverter` and `TypedDictContainerConfigConverter` now also inject the Fast-LLM `dynamic_type_name` discriminator into the imported sub-dict so the parent's `from_dict` dispatches to the right `Config` subclass without a separate ConstantImport. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stress-tests the framework's polymorphic dispatch and typed-dict
support: Apriel2's HF schema is nested (`decoder.block.mixer.{...}`,
`head.normalization`, `mixers.{name}`) and the mixer field is
heterogeneously polymorphic (Attention/Mamba/StochasticMixer/GDN/KDA).
Migrated converters: per-mixer (Attention/Mamba/GDN/KDA), the
StochasticMixer container (driven by TypedDictContainer over a
leaf-mixer registry), per-normalization (RMS/LayerNorm/NoNorm), MLP,
Block, Fixed/Pattern decoder variants (selected by Dispatch on
runtime BlockSequenceConfig type), Head, and BaseModel.
The imperative weight-side `get_converters` methods are preserved
unchanged so the multimodal Apriel2 converter (which inherits from
the text-only one) keeps working without modification.
PatternDecoder's `blocks` dict uses the homogeneous mode of
TypedDictContainer (single-class registry, no discriminator). The
attention rotary-type translation (default ↔ mistral_1d) and Mamba's
auxiliary HF fields (d_conv, conv_bias, dt_proj_bias derived from
linear-config bias flags) remain on `CustomConfigConverter` since
they're shape-changing transforms.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…primitives
Each format inherits Llama's `_create_config_converters` and replaces only the
fields that diverge:
* Mistral: ConstantImportConfigConverter pinning `add_linear_biases=False` for
attention and MLP (HF format has no `attention_bias`/`mlp_bias`); rename
`window_size` <-> `sliding_window`.
* Qwen2: ConstantImportConfigConverter for `add_linear_biases`; CustomConfigConverter
for `head_size` (no `head_dim` HF field, derive on import); CustomConfigConverter
for per-layer biases (always Q/K/V=True, dense=False); the head_dim relationship
`heads * head_size == hidden_size` moves to `_validate_export` on the base-model
converter; the use_mrope guard moves to `import_config`.
* MTP-Llama: RenameConfigConverter for `prediction_heads` (Llama blanket-ignores it).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`MixtralMLPConverter` switches its `fast_llm_config_class` to `MoEMLPConfig` so the
architecture-coverage check sees MoE-specific fields. The config-side overrides:
* `add_linear_biases` -> ConstantImportConfigConverter (Mixtral has no `mlp_bias`).
* `experts` <-> `num_local_experts` and `experts_per_token` <-> `num_experts_per_tok`
via RenameConfigConverter.
* `shared_experts=0` and `routing=topk` pinned via ConstantImportConfigConverter so
they round-trip cleanly without an HF representation.
* `router` covered by IgnoredConfigConverter (Mixtral's gate is a default `LinearConfig`).
The Fast-LLM dynamic-type discriminator (`type: "moe"`) is injected via an `import_config`
override since the MLP is wrapped via `NestedConfigConverter` rather than `DispatchConfigConverter`.
Diffusion-Dream and Diffusion-Llama need no migration: they only override `architecture`,
`get_transformers_configuration_class`, and `_export_config` (auto_map). They inherit the
declarative converters from their parents (Qwen2 and Llama).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…itives `AprielMambaConverter`, `GatedDeltaNetConverter`, and `KimiDeltaAttentionConverter` become `ConfigSectionConverter` subclasses with their HF-side fields nested under the appropriate HF subkey (`ssm_cfg` for Mamba, `linear_attn_config` for GDN/KDA). Mamba's three sibling-default fields (`d_inner`, `d_xb`, `dt_rank`) read the HF root's `hidden_size` directly via `DefaultConfigConverter.hf_default_fn` / `CustomConfigConverter`, removing the need for an explicit `parent_context` plumbing through the framework. The per-layer convolution and dt biases use `CustomConfigConverter` to pick up the mixer-wide `add_linear_biases` fallback when unset; the existing `_check_config` per-layer assertions move to `_validate_export`. `AprielBlockConverter` (the per-block dispatcher) and `AprielDecoderConverter` (the `hybrid_block_layout` driver) stay imperative because Apriel's HF format encodes the mixer type in a parent-level list rather than a per-block discriminator, which `DispatchConfigConverter` doesn't model. The `type: "mamba"`/`"gdn"`/`"kda"` Fast-LLM discriminator is injected via a one-line `import_config` override on each leaf converter (same pattern Mixtral uses). The HF format has no test coverage in `tests/models/test_checkpoint.py` or `tests/models/test_hf_roundtrip.py`, so verification was a synthesized live round-trip covering each mixer leaf plus a hybrid attention+Mamba pattern decoder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…larative primitives `PixtralNormalizationConverter` collapses to a single `_create_config_converters` override that pins `epsilon=1e-5` via `ConstantImportConfigConverter` (asserts on export, injects on import; no HF write). `PixtralEmbeddingsConverter` becomes a `ConfigSectionConverter` with declarations for `patch_height` (rename to `patch_size`), `patch_width` (mirror `patch_size` on import), `num_channels` (export-only constant 3), nested `normalization`, and an `IgnoredConfigConverter` for `patch_embeddings`. The `patch_height == patch_width` and `patch_embeddings.bias.enabled in (None, False)` checks move to `_validate_export`. The remaining Llava and Apriel2 multimodal converters stay imperative: they're cross-section aggregators (vision_config + text_config + top-level merge) whose shape doesn't fit a single ConfigSectionConverter, often with parent-context dependencies (e.g., the adapter's intermediate_size derives from the text model's hidden_size). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First step of the conversion-simplification refactor. Reintroduces declarative config-conversion primitives, applied within the post-#362 modular per-section structure, and migrates Llama as the pilot to validate the design.
Three sequential commits:
FieldHint.architecture— eight fields (attention dense_layer / softmax_scale_power, MLP activation, MoE router, four Llama3 / five Yarn rotary scaling fields, StochasticMixer main_mixer_name, vision patch height/width). These drive the new coverage check.ConfigConverterprimitives and section-converter ABC infast_llm/engine/checkpoint/external.py. Eight primitives (Rename,ConstantExport,ConstantImport,Default,Optional,Ignored,Custom,Nested,Dispatch) plusConfigSectionConverter. Walker is implicit —NestedConfigConverterandDispatchConfigConvertercall publicimport_config/export_configso subclass overrides participate. Coverage check fires only whentype(config)exactly matches the converter's declaredfast_llm_config_class, so unmigrated subclasses (Mixtral on Llama, Qwen2's_check_configoverride, etc.) keep working throughsuper().LlamaDecoderConverterstays imperative (Fixed/Pattern block-sequence dispatch doesn't fit cleanly)._check_configis retained as an overridable hook. PEFT non-default values now fail loudly on export instead of being silently dropped.Notable shape decisions (open to course-correction)
type(config) is cls.fast_llm_config_class). Strict subclasses defer to a more specific converter. This was needed to keep Mixtral working throughsuper().export_config()onMoEMLPConfigwhile only Llama is migrated.NestedConfigConverteris flat-merge only. The transformer side is assumed flat. Non-flat HF cases (Apriel2 mixers) will useDispatchConfigConverterwith anhf_path, orCustomConfigConverter.NestedConfigConverter(field, converter_class)for fixed types,DispatchConfigConverter(field, registry)for polymorphic ones. Subclasses override sub-converter classes the same way as today'sClassVar[type]pattern.parent_contextplumbing is dropped for now (was speculative, unused in Llama). Will re-introduce as an explicit kwarg when Apriel migration needs it for mamba sibling-field defaults.IgnoredConfigConverteris permissive — silently passes architecture fields through without check. Used for ParameterConfig sub-fields (init/lr_scale only, no architecture sub-fields) and for fields where Llama HF format genuinely has no representation. PEFT (which IS architecture-significant when configured) usesCustomConfigConverterwith an explicitAssert.custom(isinstance, config.peft, NoPeftConfig)instead.Verification
head_size).softmax_scale_powerand on configured PEFT.pytest tests/models/test_checkpoint.py --models gpt: 139 passed, 0 failed across llama / qwen_2 / mistral / mixtral / mtp_llama / apriel2_attn / llava / diffusion_llama.Test plan
pytest -v -n 6 tests/models/test_checkpoint.py 2>&1 | tee /tmp/fast_llm_tests/pytest_out.txtpytest -v -n 6 tests/models/test_hf_roundtrip.pypytest -v -n 6 --models gpt tests/pytest -v -n 6 fast_llm_external_models/tests/(separate invocation per CLAUDE.md)fast-llm convert --input.format llama --input.path <ref> --output.format llama --output.path <tmp>; reload both and compare configs.What's not in this PR
Phase 2 steps 3–8 of the plan (apriel2 / mistral / qwen2 / mtp_llama / mixtral / diffusion / apriel / multimodal migrations + cleanup) and the weight-converter declarative refactor are deferred. The framework is built so they can land incrementally on top of this.
🤖 Generated with Claude Code